Medical Concept Embeddings via Labeled Background Corpora
نویسندگان
چکیده
In recent years, we have seen an increasing amount of interest in low-dimensional vector representations of words. Among other things, these facilitate computing word similarity and relatedness scores. The most well-known example of algorithms to produce representations of this sort are the word2vec approaches. In this paper, we investigate a new model to induce such vector spaces for medical concepts, based on a joint objective that exploits not only word co-occurrences but also manually labeled documents, as available from sources such as PubMed. Our extensive experimental analysis shows that our embeddings lead to significantly higher correlations with human similarity and relatedness assessments than previous work. Due to the simplicity and versatility of vector representations, these findings suggest that our resource can easily be used as a drop-in replacement to improve any systems relying on medical concept similarity measures.
منابع مشابه
Weak Supervision for Semi-supervised Topic Modeling via Word Embeddings
Semi-supervised algorithms have been shown to improve the results of topic modeling when applied to unstructured text corpora. However, sufficient supervision is not always available. This paper proposes a new process, Weak+, suitable for use in semi-supervised topic modeling via matrix factorization, when limited supervision is available. This process uses word embeddings to provide additional...
متن کاملLearning Bilingual Sentiment Word Embeddings for Cross-language Sentiment Classification
The sentiment classification performance relies on high-quality sentiment resources. However, these resources are imbalanced in different languages. Cross-language sentiment classification (CLSC) can leverage the rich resources in one language (source language) for sentiment classification in a resource-scarce language (target language). Bilingual embeddings could eliminate the semantic gap bet...
متن کاملCoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases
Extracting entities and relations for types of interest from text is important for understanding massive text corpora. Traditionally, systems of entity relation extraction have relied on human-annotated corpora for training and adopted an incremental pipeline. Such systems require additional human expertise to be ported to a new domain, and are vulnerable to errors cascading down the pipeline. ...
متن کاملAssessing the Corpus Size vs. Similarity Trade-off for Word Embeddings in Clinical NLP
The proliferation of deep learning methods in natural language processing (NLP) and the large amounts of data they often require stands in stark contrast to the relatively data-poor clinical NLP domain. In particular, large text corpora are necessary to build high-quality word embeddings, yet often large corpora that are suitably representative of the target clinical data are unavailable. This ...
متن کاملAnalysis of Word Embeddings and Sequence Features for Clinical Information Extraction
This study investigates the use of unsupervised features derived from word embedding approaches and novel sequence representation approaches for improving clinical information extraction systems. Our results corroborate previous findings that indicate that the use of word embeddings significantly improve the effectiveness of concept extraction models; however, we further determine the influence...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016